An RNN-based prosodic information synthesizer for Mandarin text-to-speech

نویسندگان

Sin-Horng Chen

Shaw-Hwa Hwang

Yih-Ru Wang

چکیده

A new RNN-based prosodic information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodic information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as intersyllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent currentword phonologic states within the prosodic structure of text to be synthesized. The second hidden layer and output layer operate on a syllable-synchronized clock and use outputs from the preceding layers, along with additional syllable-level inputs fed directly to the second hidden layer, to generate desired prosodic parameters. The RNN was trained on a large set of actual utterances accompanied by associated texts, and can automatically learn many human-prosody phonologic rules, including the wellknown Sandhi Tone 3 F0-change rule. Experimental results show that all synthesized prosodic parameter sequences matched quite well with their original counterparts, and a pitch-synchronousoverlap-add-based (PSOLA-based) Mandarin TTS system was also used for testing of our approach. While subjective tests are difficult to perform and remain to be done in the future, we have carried out informal listening tests by a significant number of native Chinese speakers and the results confirmed that all synthesized speech sounded quite natural.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An RNN-based spectral information generation for Mandarin text-to-speech

In this paper, an RNN-based spectral model is proposed to generate spectral parameters for Mandarin textto-speech(TTS). The RNN is employed to learn the relations between the linguistic features and the spectral parameters. The phoneme-to-spectral parameter rules and the coarticulation rules between each two adjacent phones are automatically learned and memorized into the weights of RNN. The sy...

متن کامل

RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion

In this paper, a recurrent neural network (RNN) based prosodic modeling method for Mandarin speech-to-text conversion is proposed. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues to assist in linguistic decoding. It employs a simple three-layer RNN to learn the relationship between input prosodic features, extracted f...

متن کامل

A Corpus-Based Prosodic Modeling Method for Mandarin and Min-Nan Text-to-Speech Conversions

This talk gives an introduction to a recurrent neural network (RNN) based prosody synthesis method for both Mandarin and Min-Nan text-tospeech (TTS) conversions. The method uses a fourlayer RNN to model the dependency of output prosodic information and input linguistic information. Main advantages of the method are the capability of learning many human’s prosody pronunciation rules automaticall...

متن کامل

High-Quality Prosody Generation in Mandarin Text-to-Speech System

A text-to-speech (TTS) synthesizer is a computer-based system that can automatically read text aloud. Fujitsu is developing a Mandarin TTS system using state-of-the-art technologies. The prosodic structure of synthesized text provides important information for making synthetic speech produced by a TTS system more natural and understandable. This paper describes a global probability estimation m...

متن کامل

An NN-based Approach to Prosodic for Synthesizing English Words Em

In this paper, a neural network-based approach to generating proper prosodic information for spelling/reading English words embedded in background Chinese texts is discussed. It expands an existing RNN-based prosodic information generator for Mandarin TTS to an RNN-MLP scheme for Mandarin-English mixed-lingual TTS. It first treats each English word as a Chinese word and uses the RNN, trained fo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IEEE Trans. Speech and Audio Processing

دوره 6 شماره

صفحات -

تاریخ انتشار 1998

An RNN-based prosodic information synthesizer for Mandarin text-to-speech

نویسندگان

چکیده

منابع مشابه

An RNN-based spectral information generation for Mandarin text-to-speech

RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion

A Corpus-Based Prosodic Modeling Method for Mandarin and Min-Nan Text-to-Speech Conversions

High-Quality Prosody Generation in Mandarin Text-to-Speech System

An NN-based Approach to Prosodic for Synthesizing English Words Em

عنوان ژورنال:

اشتراک گذاری